Motivation

For this Tidy Tuesday, I would like to learn how to plot the Pizza Rating data to a map. I have been in need of building a map to display my water quality data in Florida. I will focus here on visualizing the Pizza Ratings in to help as a jumping off point for future map-making.

To begin, I have been referencing a few resources for map-building, including:

  1. R Graph Gallery, particularly the post on Bubble maps with ggplot2
  2. R Vingettes
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(here)
## here() starts at C:/Users/13216/Desktop/MADA/Module 9 Modeling/meganlott-tidytuesday1
library(sf) #Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(mapproj)
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(ggfortify)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(dplyr)
library(viridis) #color palette
## Loading required package: viridisLite
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggmap':
## 
##     wind
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv")
## Parsed with column specification:
## cols(
##   polla_qid = col_double(),
##   answer = col_character(),
##   votes = col_double(),
##   pollq_id = col_double(),
##   question = col_character(),
##   place = col_character(),
##   time = col_double(),
##   total_votes = col_double(),
##   percent = col_double()
## )
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   name = col_character(),
##   address1 = col_character(),
##   city = col_character(),
##   country = col_character()
## )
## See spec(...) for full column specifications.
pizza_datafiniti <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_datafiniti.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   address = col_character(),
##   city = col_character(),
##   country = col_character(),
##   province = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   categories = col_character(),
##   price_range_min = col_double(),
##   price_range_max = col_double()
## )
pizza_raw <- read_csv("https://raw.githubusercontent.com/tylerjrichards/Barstool_Pizza/master/pizza_data.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `_distance` = col_double(),
##   createdAt = col_datetime(format = ""),
##   deleted = col_logical(),
##   featuredMedia.__t = col_logical(),
##   featuredMedia.__v = col_logical(),
##   featuredMedia._id = col_logical(),
##   featuredMedia.createdAt = col_datetime(format = ""),
##   featuredMedia.deleted = col_logical(),
##   featuredMedia.metadata.duration = col_double(),
##   featuredMedia.metadata.frameRate = col_double(),
##   featuredMedia.modifiedAt = col_datetime(format = ""),
##   modifiedAt = col_datetime(format = ""),
##   orderProvider = col_logical(),
##   orderProvider.internalId = col_double(),
##   phoneNumber = col_double(),
##   priceLevel = col_double(),
##   providerRating = col_double(),
##   providerReviewCount = col_double(),
##   refreshDate = col_datetime(format = ""),
##   reviewStats.all.averageScore = col_double()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.

The cleaning script here has some bugs, so I skipped it, and used my own.

#we are cleaning this data to make it suiatble for us in ggplot and other mapping functions
#we want the lat/long column names to match the format used in ggplot
names(pizza_barstool)[6]<-"lat" 
names(pizza_barstool)[7]<-"long" 

#somewhere in this script, we need to filter out the pizza shops without long/lat data (there are 2)
#we may want to use a unique() function to filter out multiples of the same pizza shop - do this by address, perhaps, and not by name, in case there are shops with the same name 

pizza_cleaned = pizza_barstool %>%
  filter(!is.na(long)) 

Side question - are there shops here with the same name?

distinct(pizza_barstool, name)
## # A tibble: 452 x 1
##    name                      
##    <chr>                     
##  1 Pugsley's Pizza           
##  2 Williamsburg Pizza        
##  3 99 Cent Fresh Pizza       
##  4 Nino's 46                 
##  5 La Pizza Fresca Ristorante
##  6 La Gusto Pizza            
##  7 Cheesy Pizza              
##  8 Sal & Carmine's Pizza     
##  9 MAMA'S TOO!               
## 10 Bond 45                   
## # ... with 442 more rows
#yes, there are 11 stops that have duplicated names. Which shops are these?


pizza_chains = pizza_barstool %>% 
  group_by(name) %>% 
  filter(n()>1) %>% 
  summarize(n())

There are 452 pizza shows with unique names. This means there are 11 shops that have duplicated a name from another shop. What names have been re-used? Are these duplicates or chains?

Now, let’s look and see where these pizza shops are located.

#Write some code here that plots the data by the number of pizza shops per city.
pizza_shops_city = pizza_barstool %>% 
  group_by(city) %>%
  summarize(num_per_city = n()) %>%
  arrange(desc(num_per_city))

pizza_shops_city
## # A tibble: 99 x 2
##    city        num_per_city
##    <chr>              <int>
##  1 New York             251
##  2 Brooklyn              20
##  3 Boston                13
##  4 Las Vegas             11
##  5 Minneapolis            8
##  6 Atlanta                6
##  7 Chicago                6
##  8 Columbus               6
##  9 Hoboken                6
## 10 Nantucket              6
## # ... with 89 more rows

Top Ten Density: The cities with the most number of pizza shops included in this study were: New York, Brooklyn, Boston, Las Vegas, Minneapolis, Atlanta, Chicago, Columbus, Hoboken, and Nantucket.

As a NY Native, Joe says that John’s on 44th has the best pizza. Looking through the shops in NYC, he must be referring to John’s of Times Square. How does this compare to the average pizza ratings in NYC?

pizza_barstool %>% 
  filter(city == "New York") %>%
  arrange(name)
## # A tibble: 251 x 22
##    name  address1 city    zip country   lat  long price_level
##    <chr> <chr>    <chr> <dbl> <chr>   <dbl> <dbl>       <dbl>
##  1 10th~ 256 10t~ New ~ 10001 US       40.7 -74.0           1
##  2 3 Gi~ 548 Lag~ New ~ 10012 US       40.7 -74.0           2
##  3 310 ~ 310 Bow~ New ~ 10012 US       40.7 -74.0           2
##  4 42nd~ 647 W 4~ New ~ 10036 US       40.8 -74.0           2
##  5 5 Bo~ 386 Can~ New ~ 10013 US       40.7 -74.0           1
##  6 99 C~ 473 Lex~ New ~ 10017 US       40.8 -74.0           1
##  7 A Sl~ 727 8th~ New ~ 10036 US       40.8 -74.0           2
##  8 Ador~ 287 Hud~ New ~ 10013 US       40.7 -74.0           2
##  9 Adri~ 54 Ston~ New ~ 10004 US       40.7 -74.0           2
## 10 Alph~ 525 Gra~ New ~ 10002 US       40.7 -74.0           1
## # ... with 241 more rows, and 14 more variables: provider_rating <dbl>,
## #   provider_review_count <dbl>, review_stats_all_average_score <dbl>,
## #   review_stats_all_count <dbl>, review_stats_all_total_score <dbl>,
## #   review_stats_community_average_score <dbl>,
## #   review_stats_community_count <dbl>,
## #   review_stats_community_total_score <dbl>,
## #   review_stats_critic_average_score <dbl>,
## #   review_stats_critic_count <dbl>,
## #   review_stats_critic_total_score <dbl>,
## #   review_stats_dave_average_score <dbl>, review_stats_dave_count <dbl>,
## #   review_stats_dave_total_score <dbl>
johns_44 = pizza_barstool %>% 
  filter(name == "John's of Times Square") %>%
  summarize(score = mean(review_stats_all_average_score))

all_nyc = pizza_barstool %>%
  filter(city == "New York") %>%
  summarize(score = mean(review_stats_all_average_score))

johns_compared = full_join(johns_44, all_nyc)
## Joining, by = "score"
row.names(johns_compared)[1]<-"John's of Times Square" 
## Warning: Setting row names on a tibble is deprecated.
row.names(johns_compared)[2]<-"NYC Average"
## Warning: Setting row names on a tibble is deprecated.
#Hmmm...would this code work if we were joining more complx data?

#would like to add a visual. Would it help if we compares John's to the best of NYC? And the worst? What about the median? Or a t-test?

BUBBLE MAP As input you need:

  1. a list of GPS coordinates (longitude and latitude of the places you want to represent) CHECK! We will use the long and lat data in the pizza_cooked df.
  2. a numeric variable used for bubble color and size CHECK! We will use the review_stats_all_average_score value from the pizza_cooked df.

We’re going to make a basic Bubble Map here to show where the pizza data has been collected.

#add geom_polygon for the shape of the US, and add your scatterplot on it with geom_point() to map the locations of the pizza shops rated in this data.

map_us = map_data("usa")
head(map_us)
##        long      lat group order region subregion
## 1 -101.4078 29.74224     1     1   main      <NA>
## 2 -101.3906 29.74224     1     2   main      <NA>
## 3 -101.3620 29.65056     1     3   main      <NA>
## 4 -101.3505 29.63911     1     4   main      <NA>
## 5 -101.3219 29.63338     1     5   main      <NA>
## 6 -101.3047 29.64484     1     6   main      <NA>
names(map_us)
## [1] "long"      "lat"       "group"     "order"     "region"    "subregion"
class(map_us)
## [1] "data.frame"
ggplot() +
  geom_polygon(data = map_us, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point(data=pizza_barstool, aes(x=long, y=lat)) +
  theme_void() + coord_map() 
## Warning: Removed 2 rows containing missing values (geom_point).

#now, let's add in some information about the pizza data using color. 
#Ideally, we want to visualize the best pizza places. Let's fill in the figure by pizza rating per shop.
ggplot() +
  geom_polygon(data = map_us, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point(data=pizza_cleaned, aes(x=long, y=lat, color = review_stats_all_average_score)) +
  theme_void() + coord_map() 

#we can make this an interactive map using ggplotly
p = ggplot() +
  geom_polygon(data = map_us, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point(data=pizza_cleaned, aes(x=long, y=lat, color = review_stats_all_average_score)) +
  theme_void() + coord_map() 

ggplotly(p)

We examined the pizza ratings for the US. Now, can we zoom in on a location? Let’s Choose NYC, the city with the most amount of pizza shops included in the study.

NY: Florida is 25-31 and 88-80 IRL: 27-29 and 80-81

ny = map_data("state", region = "new york")

ny_pizza = pizza_barstool %>% filter(city == "New York", long != "NA") 

ny_pizza_plot = ggplot() +
  geom_polygon(data = ny, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) +
  geom_point(data=ny_pizza, aes(x=long, y=lat, color = review_stats_all_average_score)) +
  theme_void() + coord_map() 

ggplotly(ny_pizza_plot)

Chloropleth Map

Building a chloropleth map

A choropleth map uses the color of each region in the plot to display some value associated with the region.

The maps package is a great source of geospatial data that is readily availible in R. ggplot2 provides the function map_data() which fetches maps from the maps package and returns them in a format that ggplot2 can plot.

To make this map, we need to define region boundaries, which are typically stored in geospatial objects. These can be accessed as shapefiles, geojson files or provided in a R package.

Below, we practice making the map of Floria from the R map data.A short list of the datasets saved in maps includes: france, italy, nz, usa, world, and world2, along with county and state.

fl = map_data("state", region = "florida")
ggplot(fl) +
  geom_polygon(mapping = aes(x = long, y = lat)) + 
  coord_map() 

#let's take a look at the usa map data in R
map_us = map_data("usa")
head(map_us)
##        long      lat group order region subregion
## 1 -101.4078 29.74224     1     1   main      <NA>
## 2 -101.3906 29.74224     1     2   main      <NA>
## 3 -101.3620 29.65056     1     3   main      <NA>
## 4 -101.3505 29.63911     1     4   main      <NA>
## 5 -101.3219 29.63338     1     5   main      <NA>
## 6 -101.3047 29.64484     1     6   main      <NA>
names(map_us)
## [1] "long"      "lat"       "group"     "order"     "region"    "subregion"
class(map_us)
## [1] "data.frame"
#the map data includes long and lat data for the main (continental US)
#using the map_data function saves the map data as a data frame

ggplot(map_us, aes(x = long, y = lat, group = group)) +
  geom_polygon() + coord_map()

#using the ggfortify package, the map() function saves the data as a map

map_us = map("usa", plot = FALSE)
names(map_us)
## [1] "x"     "y"     "range" "names"
#we can use the autoplot function in ggfortify to control the aesthetics 
map_us = autoplot(map_us)
class(map_us)
## [1] "gg"     "ggplot"
pizza_barstool %>%
  filter(city == "New York") %>%
  ggplot() + 
  geom_point(aes(x = review_stats_all_average_score, y = price_level))

Building a chloropleth map will require some data wrangling. We want to ensure that a commo set of region names appears across both data sets.

Initialize a plot with the data set that contains your data. Here that is USArrests2.

Add geom_map(). Set the map_id aesthetic to the variable that contains the regions names. Then set the fill aesthetic to the fill variable. You do not need to supply x and y aesthetics, geom_map() will derive these values from the map data set, which you must set with the map parameter. Since map is a parameter, it should go outside the aes() function.

Dfollow geom_map() with expand_limits(), and tell expand_limits() what the x and y variables in the map dataset are. This shouldn’t be necessary in future iterations of geom_map(), but for now ggplot2 will use the x and y arguments of expand_limits() to build the bounding box for your plot.

Avoid distortions by adding coord_map() to your plot. coord_map() displays the plot in a fixed cartographic projection.You must first load mapproj package.

Additional Notes